Introduction to Web Scraping
Contents
- 1. What is Web Scraping?
 - 2. Common Types of Websites
 - 3. Static & Dynamic Websites
 - 4. What is the DOM Structure?
 - 5. Goals and Applications of Web Scraping
 - 6. Popular Python Libraries
 - 7. Real Example: Scraping Books from books.toscrape.com
 
1. What is Web Scraping?
Web scraping is the automated process of collecting data from websites using programs — instead of manually copying data line by line. We can write a few lines of code to get hundreds or thousands of data items in just minutes.
2. Common Types of Websites
Websites are often classified by several criteria:
- By dynamism: Static vs Dynamic websites
 - By frontend/backend technologies: React, Vue, Django, Laravel, etc.
 - By code architecture: Monolith, Microservices, etc.
 - By rendering technologies: SSR, CSR, hybrid
 
In this introduction, we only focus on classification by dynamism.
3. Static & Dynamic Websites
Static Websites:
- Use only HTML and CSS; content is “fixed” — doesn’t change per visitor.
 - Easy to scrape since content is already in the HTML.
 - Examples: Portfolio sites, product landing pages.
 
Dynamic Websites:
- Use backend processing — often PHP, Node.js, Python, etc.
 - Content changes based on user interaction or is loaded by JavaScript.
 - Harder to scrape because you must wait for the page to fully load.
 - Examples: Shopee, Facebook, real-time price tracking sites.
 
4. What is the DOM Structure?
The DOM (Document Object Model) is a tree structure of a web page. Each HTML tag is a node, which can be a parent or child of other nodes.
Simple example:
1
2
3
4
<body>
  <h1>Title</h1>
  <p>Description here</p>
</body>
In this example:
<body>is the parent<h1>and<p>are children
Larger DOM example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<html>
  <head>
    <title>Trang A</title>
  </head>
  <body>
    <div class="header">
      <h1>Welcome</h1>
    </div>
    <div class="content">
      <ul>
        <li>Sách 1</li>
        <li>Sách 2</li>
      </ul>
    </div>
    <footer>Liên hệ</footer>
  </body>
</html>
<html>is the root node containing the entire webpage.<head>contains page information like the title, not directly visible.<title>is the page title shown on the browser tab.<body>holds the main content visible to users.Inside
<body>, there are smaller parts called child nodes:<div class="header">contains the main header<h1>Welcome</h1>.<div class="content">contains a list of books with<li>items.<footer>is the footer section with the text “Contact”.
This structure is like a tree, each tag is a branch or leaf, helping us easily find and extract data when scraping websites.
5. Goals and Applications of Web Scraping
🎯 Main goals:
- Automate data collection (fast, save effort)
 - Analyze and compare prices (products, crypto, flight tickets, etc.)
 - Track content changes (news, prices, rankings, etc.)
 - Create datasets for research, machine learning, statistics
 - Integrate into internal systems like dashboards or apps
 
6. Popular Python Libraries
requests– Send HTTP requests, fetch HTML contentBeautifulSoup(bs4) – Easy HTML parsing and extractionlxml– Fast and powerful parser for HTML/XMLselenium– Automate interaction with dynamic (JS) sitesscrapy– Framework for large crawling projectshttpx– Similar to requests but supports asyncpyppeteer,playwright– Headless browser control, good for JS-heavy sites
🛠 Choose libraries based on your goals. For static sites, requests + BeautifulSoup is usually enough.
7. Real example: Scraping books from books.toscrape.com
The site books.toscrape.com is a sample site for practicing web scraping.
- It is a static website, ideal for beginners
 - Contains 1000 books spread across 50 pages
 - Simple URL structure:
 
1
https://books.toscrape.com/catalogue/page-{page_number}.html
Download source code
res = requests.get(url): Sends an HTTP request to get the webpage content at the given URL.soup = BeautifulSoup(res.text, 'html.parser'): Parses the HTML content of the page using BeautifulSoup for easier processing.books = soup.select(".col-xs-6.col-sm-4.col-md-3.col-lg-3"): Selects all HTML elements with the class"col-xs-6 col-sm-4 col-md-3 col-lg-3"— these are the tags containing information for each book on the page.
Each element in books is a “node” containing detailed information about a book, making it easy to extract details like title, image, rating, price, etc.
 
 

